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DISCUSSION OF: TREELETS AN ADAPTIVE MULTI-SCALE 
BASIS FOR SPARSE UNORDERED DATA 

59 ' By Peter J. Bickel^ and Ya'acov Ritov^ 

O ; 

I University of California and The Hebrew University of Jerusalem 
CN . 

^ ' We divide our comments on this very interesting paper into two parts 

following its own structure: 

in ■ 

CN I 1. The use of treelets in connection with the correlation matrix of X = 

■ (^1) • • • J ^p)'^ for which we have n i.i.d. copies, or as the authors refer to 

P_( ! it, "unsupervised learning." 

•^ ' 2. The use of treelets as a step in best fitting the linear regression of Xi on 

^ . {X2,. ■ ■ ,Xp) . 

-(— > . 

1 ^i ' 1. Unsupervised learning. The authors' emphasis is on the method as a 

useful way of representing data analogous to a wavelet representation where 

'T^ I X = X(t) with t genuinely identified with a point on the line and observation 

vQ ■ at p time points, but where the time points have been permuted. 

T-H ! As such, this can be viewed as a clustering method which, from their 

^^ ' examples, gives very reasonable answers. However, to make more general 

^^ . theoretical statements and to permit comparison to other methods, they 

^ I necessarily introduce the model 

OO : K 

O. (1) X = Y^U,v, + aZ„ 

p\ ' where U = (C/i, . . . , Uk) is an unobservable vector, the Vj are fixed unknown 

^ . vectors, and Z ~ Np{{), Jp), where Jp is the identity, Np is the p dimensional 

Gaussian distribution, and U, Z are independent. 

At this point, we are a bit troubled by the authors' analysis. We believe a 
key point, that is only stressed implicitly by the authors, is that the popula- 
tion tree structure, as defined, is only a function of the population covariance 
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matrix. This is clear at Step 1, and follows since the Jacobi transformations 
depend only on the covariance and variances of the coordinates involved. 
This raises a problematic issue. If U, and hence X, has a Gaussian distribu- 
tion, then the structure as postulated in (1) is not identifiable, as in known 
in factor analysis. Consider, for instance. Example 2. If we redefine Uj = Uj, 
j = 1, 2, ^3 = civi + C2V2, and C/3 = 0, we are at the same covariance matrix 
as in (19) with only two nonoverlapping blocks. 

The treelets transform evidently gives a decomposition attuned to the 
authors' beliefs of a block diagonal population structure with high intrablock 
correlation. But the theoretical burden of exhibiting classes of covariance 
matrices, other than ones whose eigenvectors are not only orthogonal but 
have disjoint support, and for which some version of sparse PCA cannot be 
utilized just as well, remains. 

This is an insurmountable problem for any population parameter which 
is a function only of the covariance matrix. 

A second difficulty, special to the treelets parameter T(S), is that it is 
not defined uniquely for S for which the maximal off diagonal correlation is 
not uniquely assumed. This is refiected in the authors' discussion in Section 
3.1 of the possible instability of the empirical tree. In this context, we don't 
understand their statement that inferring T(S) is not the goal. If not, what 
is? 

This issue makes comparison to the other methods difficult. As they state 
any of the several methods for sparse PCA, for example, d'Aspremont et al. 
(2007), Johnstone and Lu (2008), would yield the same answer as theirs for 
their Example 1. 

But is there a way of proceeding which teases out explicitly structures such 
as in (19) without limiting oneself to the covariance matrix? Suppose that 
we can write U = Be, where e = (ei, . . . , ex) is a vector of independent 
not necessarily identically distributed variables, such that at most one of 
them is Gaussian. That is, we assume the factor loading themselves are 
obtained structurally. Then we can write for i= 1, . . . ,n, j = 1, . . . ,p, Xij = 
J2i=i Cjieu + aZij, where C = [Cji] isapx K matrix, the Zij are i.i.d. A^(0, 1), 
and ej = {en, . . . , Cjx)"'" are independent as above. Here, C = VB, where V = 
{vi,. .. ,Vk)- We conjecture that if p, n — > 00 with K fixed, and the columns 
of C are sparse, we can recover C up to a scale multiple of each row, and a 
permutation of the columns. Work on this conjecture is in progress. 

2. Supervised learning. Can we select variables based on the X, the pre- 
dictor variables, themselves? The tempting answer is yes (e.g., using PCA). 
The theoretical answer is no {Y can be a function of each component). 
The practical answer is at most a cautious yes; cf. Cook (2007) for a re- 
cent discussion. However, one should be careful to justify working with the 
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predictions without the Y, since current regression methods permit one to 
handle models with almost exponentially many variables. 

The LASSO type of estimator can handle sparse models. However, spar- 
sity is an elusive property, since the LASSO can deal with sparsity in a 
given basis, while a sparse representation may exist only in some other ba- 
sis. Treelets are proposed as a method which enriches the description of the 
model, and gives the user an over-rich collection of vectors which span the 
Euclidean space. Hopefully the tree cluster features are rich enough so the 
model can be approximated by the linear span of relatively few, say, no more 
than o(n/ log n) terms. 

The suggested algorithm deals with complexity by serial optimization in a 
fashion similar to standard model selection methods (e.g., forward selection), 
boosting, etc. It is not clear to us why the authors select the variables from 
one level and not from their union, since again modern methods can deal 
with any polynomial number of regressors. 

To asses performance of the algorithm, we considered a simple version of 
the authors' supervised errors-in-variables model, but in an asymptotic set- 
ting. Suppose we observe n i.i.d. replicates from the distribution oi {Y,Xi,. . . , 
Xp), where p = Pn and 

Y = ^Z + £, 

Xi = CpZ + r]i, i = l,...,p, 

where e,Z ^ ^{0, 1), r^j ~ A^(0, af), all independent. This is a classical error 
in variables model, where the Xi are independent observations on Zi and 
the best predictor is given by 

Consider first Cp = p~^''^, with all cjj = 1, 7 7^ and, in particular, ci x 

J2^=i^7 = 1- In this case all variables are interesting, and have the same 
weight for prediction. However, the covariance matrix of X has all diagonal 
terms greater than 1, and all off diagonal terms are p~^. This model is not 
sparse — for instance, in the sense of El Karoui (2008), and is also inaccessible 
to regularized covariance estimation. The Treelet Algorithm will not be 
able to find this term. This model is significantly different from the null, 
and a consistent predictor exists given known parameter values. However, 
no standard general purpose algorithm will be able to deal with this model. 
A small set of simulations show that, in fact, there is a range of values of Cp 
for which PCA works better than treelets. However, for larger values of Cp, 
treelets work surprisingly well. 

The restriction to a basis of a relatively small collection of transform 
variables is a limitation. In Bickel, Ritov and Tsybakov (2008) a general 
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methodology was suggested for construction of a rich collection of basis 
functions. Formally, we consider the following hierarchical model selection 
method. For a set of functions J- with cardinality \T\ > -fC, let MSk be 
some procedure to select K functions out of T . We denote by M.Sk{,T) the 
selected subset of JF, \M.Sk{,J^) \=K, K = ni for some 7 < oo. Define / © 5 
to be the operator combining two base variables, for instance, multiplication. 
The procedure is defined as follows: 



(i) Set^o = {^i,. 
(ii) For m = 1,2, . . 



. ,Xp}. 
let 



(iii) Continue until convergence is declared. The output of the algorithm 
is the set of functions M.SK[J^m) for some m. 

Bickel, Ritov and Tsybakov consider f (B g = fg, since they consider models 
with interaction. The treelets construction is similar to this one, with each 
step yielding two new functions, which result from PCA applied to a pair 
of variables. There is one essential difference between our approach and the 
treelets algorithm. We also keep at each step the complexity of the over- 
determined collection in check, but let the complexity increase with the 
increase with levels. 
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